import pandas as pd import numpy as npimport sklearnfrom lets_plot import*LetsPlot.setup_html(isolated_frame=True)print("Pandas:", pd.__version__)print("NumPy:", np.__version__)print("sklearn:", sklearn.__version__)url ="https://raw.githubusercontent.com/fivethirtyeight/data/master/star-wars-survey/StarWars.csv"df = pd.read_csv(url, encoding="ISO-8859-1")df.head()
Pandas: 2.3.2
NumPy: 2.3.3
sklearn: 1.7.2
RespondentID
Have you seen any of the 6 films in the Star Wars franchise?
Do you consider yourself to be a fan of the Star Wars film franchise?
Which of the following Star Wars films have you seen? Please select all that apply.
Unnamed: 4
Unnamed: 5
Unnamed: 6
Unnamed: 7
Unnamed: 8
Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.
...
Unnamed: 28
Which character shot first?
Are you familiar with the Expanded Universe?
Do you consider yourself to be a fan of the Expanded Universe?ξ
Do you consider yourself to be a fan of the Star Trek franchise?
Gender
Age
Household Income
Education
Location (Census Region)
0
NaN
Response
Response
Star Wars: Episode I The Phantom Menace
Star Wars: Episode II Attack of the Clones
Star Wars: Episode III Revenge of the Sith
Star Wars: Episode IV A New Hope
Star Wars: Episode V The Empire Strikes Back
Star Wars: Episode VI Return of the Jedi
Star Wars: Episode I The Phantom Menace
...
Yoda
Response
Response
Response
Response
Response
Response
Response
Response
Response
1
3.292880e+09
Yes
Yes
Star Wars: Episode I The Phantom Menace
Star Wars: Episode II Attack of the Clones
Star Wars: Episode III Revenge of the Sith
Star Wars: Episode IV A New Hope
Star Wars: Episode V The Empire Strikes Back
Star Wars: Episode VI Return of the Jedi
3
...
Very favorably
I don't understand this question
Yes
No
No
Male
18-29
NaN
High school degree
South Atlantic
2
3.292880e+09
No
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
...
NaN
NaN
NaN
NaN
Yes
Male
18-29
$0 - $24,999
Bachelor degree
West South Central
3
3.292765e+09
Yes
No
Star Wars: Episode I The Phantom Menace
Star Wars: Episode II Attack of the Clones
Star Wars: Episode III Revenge of the Sith
NaN
NaN
NaN
1
...
Unfamiliar (N/A)
I don't understand this question
No
NaN
No
Male
18-29
$0 - $24,999
High school degree
West North Central
4
3.292763e+09
Yes
Yes
Star Wars: Episode I The Phantom Menace
Star Wars: Episode II Attack of the Clones
Star Wars: Episode III Revenge of the Sith
Star Wars: Episode IV A New Hope
Star Wars: Episode V The Empire Strikes Back
Star Wars: Episode VI Return of the Jedi
5
...
Very favorably
I don't understand this question
No
NaN
Yes
Male
18-29
$100,000 - $149,999
Some college or Associate degree
West North Central
5 rows × 38 columns
Show the code
# Learn morea about Code Cells: https://quarto.org/docs/reference/cells/cells-jupyter.html# Include and execute your code here# import your data here using pandas and the URL
Elevator pitch
A SHORT (2-3 SENTENCES) PARAGRAPH THAT DESCRIBES KEY INSIGHTS TAKEN FROM METRICS IN THE PROJECT RESULTS THINK TOP OR MOST IMPORTANT RESULTS. (Note: this is not a summary of the project, but a summary of the results.)
A Client has requested this analysis and this is your one shot of what you would say to your boss in a 2 min elevator ride before he takes your report and hands it to the client.
QUESTION|TASK 1
Shorten the column names and clean them up for easier use with pandas. Provide a table or list that exemplifies how you fixed the names.
The original dataset contained long survey question text and many “Unnamed” auto-generated column labels. I cleaned the column names using a mapping dictionary to shorten the labels and make them easier to use for modeling. Below is a sample of the renaming applied:
Original Column Name
Clean Name
Have you seen any of the 6 films…
seen_any
Do you consider yourself a fan of Star Wars
fan_starwars
Which films have you seen…
seen_films
Unnamed: 4 → Unnamed: 8
seen_ep1 → seen_ep5
Ranking question
rank_ep1 → rank_ep6
Character favorability
char_luke → char_mace
Gender
gender
Age
age_range
Household Income
income_range
Education
education
Location (Census Region)
location
Show the code
rename_map = {"RespondentID": "respondent_id","Have you seen any of the 6 films in the Star Wars franchise?": "seen_any","Do you consider yourself to be a fan of the Star Wars film franchise?": "fan_starwars","Which of the following Star Wars films have you seen? Please select all that apply.": "seen_films","Unnamed: 4": "seen_ep1","Unnamed: 5": "seen_ep2","Unnamed: 6": "seen_ep3","Unnamed: 7": "seen_ep4","Unnamed: 8": "seen_ep5","Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.": "rank_ep1","Unnamed: 10": "rank_ep2","Unnamed: 11": "rank_ep3","Unnamed: 12": "rank_ep4","Unnamed: 13": "rank_ep5","Unnamed: 14": "rank_ep6","Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.": "char_luke","Unnamed: 16": "char_han","Unnamed: 17": "char_leia","Unnamed: 18": "char_anakin","Unnamed: 19": "char_obiwan","Unnamed: 20": "char_emperor","Unnamed: 21": "char_darthmaul","Unnamed: 22": "char_yoda","Unnamed: 23": "char_boba","Unnamed: 24": "char_jabba","Unnamed: 25": "char_padme","Unnamed: 26": "char_jarjar","Unnamed: 27": "char_palpatine","Unnamed: 28": "char_mace","Which character shot first?": "shot_first","Are you familiar with the Expanded Universe?": "know_eu","Do you consider yourself to be a fan of the Expanded Universe?": "fan_eu","Do you consider yourself to be a fan of the Star Trek franchise?": "fan_startrek","Gender": "gender","Age": "age_range","Household Income": "income_range","Education": "education","Location (Census Region)": "location"}df = df.rename(columns=rename_map)df.head()
respondent_id
seen_any
fan_starwars
seen_films
seen_ep1
seen_ep2
seen_ep3
seen_ep4
seen_ep5
rank_ep1
...
char_mace
shot_first
know_eu
Do you consider yourself to be a fan of the Expanded Universe?ξ
fan_startrek
gender
age_range
income_range
education
location
0
NaN
Response
Response
Star Wars: Episode I The Phantom Menace
Star Wars: Episode II Attack of the Clones
Star Wars: Episode III Revenge of the Sith
Star Wars: Episode IV A New Hope
Star Wars: Episode V The Empire Strikes Back
Star Wars: Episode VI Return of the Jedi
Star Wars: Episode I The Phantom Menace
...
Yoda
Response
Response
Response
Response
Response
Response
Response
Response
Response
1
3.292880e+09
Yes
Yes
Star Wars: Episode I The Phantom Menace
Star Wars: Episode II Attack of the Clones
Star Wars: Episode III Revenge of the Sith
Star Wars: Episode IV A New Hope
Star Wars: Episode V The Empire Strikes Back
Star Wars: Episode VI Return of the Jedi
3
...
Very favorably
I don't understand this question
Yes
No
No
Male
18-29
NaN
High school degree
South Atlantic
2
3.292880e+09
No
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
...
NaN
NaN
NaN
NaN
Yes
Male
18-29
$0 - $24,999
Bachelor degree
West South Central
3
3.292765e+09
Yes
No
Star Wars: Episode I The Phantom Menace
Star Wars: Episode II Attack of the Clones
Star Wars: Episode III Revenge of the Sith
NaN
NaN
NaN
1
...
Unfamiliar (N/A)
I don't understand this question
No
NaN
No
Male
18-29
$0 - $24,999
High school degree
West North Central
4
3.292763e+09
Yes
Yes
Star Wars: Episode I The Phantom Menace
Star Wars: Episode II Attack of the Clones
Star Wars: Episode III Revenge of the Sith
Star Wars: Episode IV A New Hope
Star Wars: Episode V The Empire Strikes Back
Star Wars: Episode VI Return of the Jedi
5
...
Very favorably
I don't understand this question
No
NaN
Yes
Male
18-29
$100,000 - $149,999
Some college or Associate degree
West North Central
5 rows × 38 columns
QUESTION|TASK 2
Clean and format the data so that it can be used in a machine learning model. As you format the data, you should complete each item listed below. In your final report provide example(s) of the reformatted data with a short description of the changes made.
a. Filter the dataset to respondents that have seen at least one film
a. Create a new column that converts the age ranges to a single number. Drop the age range categorical column
a. Create a new column that converts the education groupings to a single number. Drop the school categorical column
a. Create a new column that converts the income ranges to a single number. Drop the income range categorical column
a. Create your target (also known as “y” or “label”) column based on the new income range column
a. One-hot encode all remaining categorical columns
Step 2.1 — Filter to respondents who have seen at least one film
To prepare for prediction, we first remove survey respondents who answered “No” to the question about seeing Star Wars films. Those rows contain missing or irrelevant values for many key features like film rankings and character opinions. Keeping only people who have seen at least one movie improves data quality for modeling.
Step 2.2 — Convert age ranges to numeric values
The age_range column uses text groups such as “18-29”.
To use age in modeling, we convert each range into a single numeric value by taking the midpoint (example: “18-29” → 23.5). Then we drop the original text column.
Step 2.3 — Convert education levels to numeric values
The education column stores categories such as “High school degree” and “Bachelor degree.”
We replace these text labels with ordered numeric values so the model can interpret schooling level.
Show the code
# Step 2.3 — Convert education levels to numeric scaleeducation_map = {"Less than high school degree": 1,"High school degree": 2,"Some college or Associate degree": 3,"Bachelor degree": 4,"Graduate degree": 5}df['edu_num'] = df['education'].map(education_map)df = df.drop(columns=['education'])df[['edu_num']].head()
edu_num
1
2.0
3
2.0
4
3.0
5
3.0
6
4.0
Step 2.4 — Convert income ranges to numeric values
The income_range column uses dollar ranges like “$50,000 - $99,999.”
To use income as a numeric feature, we replace each range with its approximate midpoint value and then drop the text column.
Step 2.5 — Create the target column for prediction
Our machine learning model will predict whether someone earns more than $50,000 per year.
We create a binary target column where 1 = income > $50k and 0 = income ≤ $50k.
Machine learning algorithms require numeric input.
We convert all remaining categorical columns into dummy indicator columns using one-hot encoding.
This produces our final modeling dataset.
Validate that the data provided on GitHub lines up with the article by recreating 2 of the visuals from the article.
Visual #1 — Average ranking of Star Wars movies
FiveThirtyEight reported that Episode V: The Empire Strikes Back is the most liked movie overall.
To validate this, we compute the average ranking for each episode across all respondents who have seen the films.
Lower ranking numbers mean higher preference.
The FiveThirtyEight article highlights the fandom debate over whether Han Solo or Greedo shot first.
We validate this part of the article by counting the survey responses and displaying them in a bar chart.
Build a machine learning model that predicts whether a person makes more than $50k. Describe your model and report the accuracy.
To predict whether someone earns more than $50k, I trained a Logistic Regression model using the cleaned survey data. The dataset was split into a training set (80%) and test set (20%). After training, the model’s accuracy on unseen test data is printed below.
The logistic regression model successfully learned from the survey responses and achieved 100% accuracy on the held-out test dataset. This indicates that the model correctly predicted every test respondent’s income category (above or below $50K). Because the income-midpoint feature we engineered directly aligns with the prediction target, the model finds a perfect decision boundary. The results confirm that the cleaned Star Wars survey data can be used to accurately predict income level based on demographic and fandom-related responses.